On reducing load/store latencies of cache accesses
نویسندگان
چکیده
Effective address calculation for load and store instructions needs to compete for ALU with other instructions and hence extra latencies might be incurred to data cache accesses. Fast address generation is an approach proposed to reduce cache access latencies. This paper presents a fast address generator that can eliminate most of the effective address computations. Experimental results show that this fast address generator can reduce effective address computations of load and store instructions by about 74% on average for SPECint2000 benchmarks and cut the execution times by 8.5%. In addition, further improvement can be made if data of previous load operations are buffered in the unused data field of LSQ entries as well. Runtime impact will expand to 10.5% on average when the default LSQ is modified to the cached LSQ design.
منابع مشابه
Improving Memory Access Performance Using a Code Coalescing Unit
High clock frequencies combined with deep pipelining employed by many of the state-of-the-art processors have forced cache hit accesses to be multi-cycle operations. For many programs, untolerated load latencies account for a signiicant portion of total execution time. In this paper, we present a mechanism called the Code Coalescing Unit (CCU) that can identify and eliminate at run-time several...
متن کاملDuplicating and Deconstructing Virtual Load/Store Queues
ABSTRACT Virtual load/store queues (VLSQs) within existing physical load/store queues (LSQ) have been proposed as an effective mechanism for reducing energy losses and increasing performance. The VLSQ restricts reordering of memory operations by limiting the number of memory instructions visible to the issue logic. This decreases the amount of execution time wasted in replay traps and leads to ...
متن کاملReducing the LSQ and L1 Data Cache Power Consumption
In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy) has become a major consumer of power. In order to reduce this power consumption, we propose in this paper two techniques, one to filter accesses to the LSQ (Load-Store Queue) based on both timing and address information, and the other to filter accesses to the first level data cache based on a f...
متن کاملDesign and Evaluation of a Switch
Cache coherent non-uniform memory access (CC-NUMA) multiprocessors provide a scal-able design for shared memory but they continue to suuer from large remote memory access latencies due to comparatively slow memory technology and data transfer latencies in the in-terconnection network. In this paper, we propose a novel hardware caching technique, called switch cache, to improve the remote memory...
متن کاملDesign and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors
ÐCache coherent nonuniform memory access (CC-NUMA) multiprocessors provide a scalable design for shared memory. But, they continue to suffer from large remote memory access latencies due to comparatively slow memory technology and large data transfer latencies in the interconnection network. In this paper, we propose a novel hardware caching technique, called switch cache, to improve the remote...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Systems Architecture - Embedded Systems Design
دوره 56 شماره
صفحات -
تاریخ انتشار 2010